-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13790] Speed up ColumnVector's getDecimal #11624
Conversation
We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4
Test build #52800 has finished for PR 11624 at commit
|
cc @davies |
Right now, it's not safe to re-use the objects from reader or UnsafeRow, because some of the expression may hold the object (for example, aggregate without grouping key, and some string functions). That's the reason we know the cost of creating new object every time you access UTF8String/Decimal/Array/MapArray/Struct, but have not optimize it yet. I tried this patch locally, generate a parquet file with one decimal column, then read it and aggregate with max(d) and min(d), the min(d) will return wrong result:
t1 is the table before saving as parquet file, t2 is the table loaded from parquet file. In order to having these optimization, we need to prove that we always make the copy before holding a reference to a object that could be re-used. There are still some places we are using MutableGenericInternalRow, we also should do the copy when update it. If we only re-use the object for new parquet reader, but do the copy for all other places, this may cause performance regression for other data sources. |
Noted. The object reuse was not the slow part. Here's a variant that doesn't do the expensive checking and the performance improvement is the same. |
Could you also update the UnsafeRow to use this new API? |
Test build #52847 has finished for PR 11624 at commit
|
LGTM, merging this into master, thanks! |
Test build #52849 has finished for PR 11624 at commit
|
## What changes were proposed in this pull request? We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. ## How was this patch tested? Existing tests and this benchmark ``` TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4 ``` Author: Nong Li <nong@databricks.com> Closes apache#11624 from nongli/spark-13790.
What changes were proposed in this pull request?
We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.
How was this patch tested?
Existing tests and this benchmark